Clustering

Put simply, the task of clustering is to place observations that seem similar within the same cluster. Clustering is commonly used in two dimensional data where the goal is to create clusters based on coordinates. Here, we will use something similar. We will cluster houses based on their latitude-longitude locations using several different clustering methods.

We will start off by getting some data. We will use data of 20,000+ California houses dataset. We will then learn whether housing prices directly correlate with map location.

We will use the VegaLite package here for plotting. This package makes it very easy to plot information on a map. All you need is a JSON file of the map you intend to draw. Here, we will use the California counties JSON file and plot each house on the map and color code it via a heatmap of the price. This is done by this line color="median_house_value:q"

Note that the cell above may take a few minutes to run!

One thing we will try and explore in this notebook is if clustering the houses has any direct relationship with their prices, so we will bucket the houses into intervals of $50000 and re perform the color codes based on each bucket.

🟤K-means clustering

Yes, location affects price of the house but this means location as in proximity to water, prosimity to downtown, promisity to a bus stop and so on

lets' see if this remains true for the rest.

🟤K-medoids clustering

For this type of clustering, we need to build a distance matrix. We will use the Distances package for this purpose and compute the pairwise Euclidean distances.

🟤Hierarchial Clustering

🟤DBscan

Finally...

After finishing this notebook, you should be able to:

🥳 One cool finding

Prices in California do not seem to have an exact mapping with geographical locations. In specifc, performing a clustering algorithm on the houses dataset we had did not reveal a mapping with the price ranges. This indicate that prices relationship to geographical location is not necessairly based on neighborhood but probably other factors like closeness to the water or closeness to a downtown.